Cognitive fatigability assessment test (cFAST): Development of a new instrument to assess cognitive fatigability and pilot study on its association to perceived fatigue in multiple sclerosis

Background Fatigue is a common symptom of many diseases, including multiple sclerosis. It manifests as a cognitive or physical condition. Fatigue is poorly understood, and effective therapies are missing. Furthermore, there is a lack of methods to measure fatigue objectively. Fatigability, the measurable decline in performance during a task, has been suggested as a complementary method to quantify fatigue. Objective To develop a new and objective measurement of cognitive fatigability and investigate its association with perceived fatigue. Methods We introduced the cognitive fatigability assessment test (cFAST), a novel smartphone-based test to quantify cognitive fatigability. Forty-two people with multiple sclerosis (23 fatigued and 19 non-fatigued, defined by the Fatigue Scale for Motor and Cognitive Functions) took part in our validation study. Patients completed cFAST twice. We used t-tests, Monte Carlo sampling, and area under the receiver operating characteristic curves to evaluate our approach using two sets of proposed metrics. Results When classifying fatigue, our fatigability metric Δresponse time has a mean area under the receiver operating characteristic curve of 0.74 (95% CI 0.64–0.84), making it the best performing metric for this task. Furthermore, Δresponse time shows a statistically significant difference between the fatigued and non-fatigued groups (t = 2.27, P = .03). Particularly, cognitively-fatigued patients decline in performance, while non-fatigued patients do not. Conclusions We introduce cFAST, a new instrument to quantify cognitive fatigability. Our pilot study provides evidence that cognitive fatigability assessment test produces a quantifiable drop in cognitive performance in a short period. Furthermore, our results indicate that cFAST may have the potential to serve as a surrogate for subjective cognitive fatigue. cFAST is significantly shorter than the existing fatigability assessments and does not require specialized equipment. Thus, it could enable frequent and remote monitoring, which could substantially aid clinicians in better understanding and treating fatigue.


Introduction
Background Fatigue is a highly prevalent and devastating symptom of many diseases, including Parkinson's disease, 1 multiple sclerosis (MS), 2 and more recently, post-COVID syndrome. 3 In MS, fatigue is rated as the most frequent and debilitating symptom. 2,4,5 Fatigue has been defined as the subjective feeling of overwhelming exhaustion and tiredness and can manifest as a physical and cognitive symptom. 6 The symptom is still poorly understood, and its severity can only be assessed subjectively. Currently, this is done using questionnaires such as the Fatigue Severity Scale (FSS), 7 Modified Fatigue Impact Scale (MFIS), 8 and Fatigue Scale for Motor and Cognitive Functions (FSMC). 9 More than a dozen fatigue questionnaires are available. 10 These questionnaires are used as patient-reported outcome measures in clinical trials. Their heterogeneity and subjective nature are a challenge for using them as outcome measures in clinical trials and comparing the efficacy of results across different studies. Results from different randomized placebo-controlled clinical trials testing different compounds for treating fatigue showed contradictory results, with some showing good efficacy, and others exhibiting no effect. [11][12][13][14][15][16][17][18] Fatigue and fatigability The perception of fatigue (subjective measurement) is being differentiated from performance fatigability (objective measurement). 19 Fatigability is further divided into the motor and cognitive domains. Motor fatigability has been quantified as the decline in peak performance, power, or speed during physical activity. 20 On the other hand, cognitive fatigability measures the decline of cognitive performance during a task that requires sustained attention, 20 and it has been measured as an increase in reaction time, decline in accuracy or by comparing the performance during the first and last third of a task. 21,22 Establishing an association between objective fatigability and subjective fatigue is an important goal for clinical research but has been proven difficult. 19 While a correlation between motor fatigability and perceived fatigue has been suggested in several studies, [23][24][25][26][27] less data is available on cognitive fatigability. [28][29][30][31][32][33][34][35] A possible cause is the complexity of inducing cognitive fatigability and the lack of consensus and dedicated tests to quantify it. 22 Prior studies used one of two strategies to generate cognitive fatigability. Either they conducted a test battery, including the same test before and after fatiguing tasks and compared their performance, or they employed a single prolonged cognitive task and measured the decline in performance within the task. Some of the used cognitive tests within fatigability research include: (1) the Paced Auditory Serial Addition Test (PASAT), 36 (2) the Psychomotor vigilance task (PVT), 37 and (3) the Stroop test. 38 However, utilizing these non-specific cognitive performance tests to assess cognitive fatigability comes with certain drawbacks, such as long testing sessions.

Limitations of cognitive fatigability studies
Fatigability in healthy subjects is typically studied through long examination sessions. Van der Linden et al. induced fatigue through two hours of cognitively demanding tasks. Their study showed a significant difference in planning ability and increased perseverative errors between the non-fatigued and fatigued participants. 32 Other cognitive fatigability studies in healthy subjects using the Stroop test employed a study length of 3 and 2 h for young adults 34 and for older adults, 35 respectively. However, long testing sessions are not unique to healthy subjects. Moeller et al. administered two hourly test batteries for analyzing cognitive fatigability using three neuropsychological tests in subjects with mild traumatic brain injury. 33 In MS, there is large heterogeneity when it comes to studying cognitive fatigability. DeLuca et al. 39 studied fatigue in 15 people with MS (pwMS) and 15 controls by conducting four modified Symbol Digit Modality Test (mSDMT) trials over an hour of fMRI scanning where users were shown different symbol-digit pair probes at varying interstimulus. Participants had to respond "match" or "no match" to each probe by following a provided symbol-digit arrangement. The interstimulus interval randomly varied between 0, 4, 8, and 12 s. Results from their study found no cognitive fatigability. Chen et al. 40 also studied fatigability using an mSDMT within a fMRI setting. During examination pwMS and controls completed a total of eight mSDMT (four with high cognitive load and four with low cognitive load), each lasting 7.7 min. The authors did not study within trial performance, but across trial performance showed an increase in reaction time associated with subjective fatigue in pwMS. Berard et al. compared the performance during quintiles of a 20 min PVT session to quantify cognitive fatigability and found a greater increase in reaction time of patients compared to healthy controls. 31 PVT is a simple reaction time task where participants have to press a button in response to the presence of a stimulus. However, its repetitive and monotonous nature often results in participants reporting feelings of boredom, 41 and thus the performance decline may be influenced by a lack of motivation rather than fatigability. 42 Finally, several authors employed the PASAT by comparing the decrease in accuracy between the beginning and end of the test. [28][29][30][42][43][44] Even though the PASAT is applied in many studies, there is still significant methodological heterogeneity. First, some studies compared the performance between the first and the second half of the test 28 , while others compared the performance between thirds. 29 Second, despite there seems to be a general consensus of 3 s length inter-stimulus interval (ISI), this has not been uniformly applied in fatigability studies. 20,30,44 Third, it is known that pwMS may adopt a "chunking strategy," particularly as task demands increase, 45 meaning that they add two numbers, skip one, and add the following two, thus, reducing the overall difficulty of the task by decreasing the simultaneous cognitive load. Only recently, first normative data on cognitive fatigability has been generated to account for the chunking strategy. 43 Fourth, the PASAT requires a medical examiner to conduct the test, making it more expensive to administer. Finally, patients have described the PASAT as unpleasant and causing anxiety, 46 limiting the applicability and repeatability of the tests.

Aims and overview of the study
We propose a new test for measuring cognitive fatigability in a short period (i.e. 5 min) and refer to it as the Cognitive Fatigability Assessment Test (cFAST). cFAST is inspired by the Symbol Digit Modality Test (SDMT) digit-symbol matching logic. 47 SDMT is a cognitive test that measures information processing speed. Studies showed that the SDMT is relatively resistant to practice effects, 48 in particular when rearranging the keys, 49 making it an attractive tool for cognitive monitoring over time in clinical trials. 50 Moreover, it has also been validated for smartphones. 51 Our study uses a similar key-symbol matching strategy to measure fatigability instead of cognitive impairment.
The goals of this feasibility study were two-fold. The first goal was to develop an objective and ubiquitous measurement of cognitive fatigability. We achieved this goal by implementing a smartphone-based test through an iterative process involving patients, neuropsychologists and neurologists. We opted for a smartphone-based implementation given the high acceptability and interest of pwMS in smartphone-based tools that allow them to monitor and manage their condition. [52][53][54][55][56][57][58][59] The second goal was to study the association between the newly developed objective measurement (cFAST) and perceived cognitive fatigue. We approached this goal by conducting a pilot study with pwMS who completed the cFAST and the FSMC. 9 Using the FSMC cognitive subscale, we assign the participants to the cognitivefatigued (subscale> = 22) and non-cognitive-fatigued (sub-scale<22) groups. 9 From the cFAST, we extracted a set of metrics and evaluated group differences with t-tests. Through area under the receiver operating characteristics (AUROC), we assessed the performance of our proposed metrics to classify cognitive fatigued versus non-cognitive-fatigued patients. Furthermore, we investigated the relationship of our proposed test (cFAST) and metrics to disability.

Development of the smartphone-based test, cFAST
We aimed to develop a test to objectively quantify cognitive fatigability, that meets the requirements: (1) engages cognitive processing speed and induces cognitive load, (2) is short, self-explanatory, and allows for remote monitoring, and (3) does not require medical supervision. We followed an iterative process during the design and development of the application. The medical professionals reviewed different prototypes to ensure an appropriate design based on clinical theory and practice is implemented. Additionally, we gathered informal feedback from people with MS (pwMS) regarding our prototypes before converging on our final design. Refer to the Supplements for further details on the prototypes designs and selection. Figure 1 displays the user interface of the cFAST and highlights each of its elements. The test is designed to be carried out by holding the smartphone in landscape mode.
The middle of the screen shows a large blue symbol (main symbol). The main symbol has to be mapped to its corresponding digit following the mapping rule displayed at the top of the screen. Selection occurs by tapping the numbers located at the bottom of the screen. Users have a limited time to find the corresponding number associated with the main symbol. A yellow progress bar around the symbol indicates how much time is left until the symbol is changed automatically. The main symbol changes under two circumstances: (1) after the user taps a number or (2) when the progress bar has entirely run out. Every time a new symbol appears, the associations and positions of the top mapping rule are randomized, and the progress bar is restarted. The randomization seeks to diminish the possibility of a learning effect associated with memorizing the digit-symbol mapping within the same test run. The progress bar works as a pressure mechanism to motivate users to be fast and avoid resting periods. A timer located at the top left indicates how much time is left for the test to end. Users can exit the test at any moment by tapping the exit button located at the top right corner. If exited early, the test is considered invalid. Our test is inspired by the SDMT, 47 as it is a widely used, accepted, and validated cognitive assessment test in MS. However, cFAST differs from the SDMT in several aspects: 1. cFAST is a cognitive fatigability test, while SDMT assesses cognitive impairment and working memory. 2. Contrary to the SDMT, cFAST does not allow participants to look ahead to match the following symbols. Hence, participants have no way to anticipate the next answer to reduce their response time. 3. There is a time limit to complete each selection in cFAST. 4. cFAST randomizes the matching rules after each answer, while SDMT has a fixed matching rule. 5. The duration of a cFAST session is 5 min, while the SDMT lasts 90 s. The increased duration is needed because cognitive fatigability is notoriously hard to elicit in a short time. However, cFAST is comparatively significantly shorter than previous attempts at measuring cognitive fatigability.
All these design considerations seek to evaluate cognitive fatigability.

Application logic
cFAST is designed with the aim of being conducted outside the clinic and without medical supervision. Therefore, the application logic is self-explanatory and contains a personalization phase to maximize the users' understanding and tailor it to their performance. This phase needs to be completed before being able to run cFAST. Figure 2 depicts the application logic diagram. At the start of the personalization phase, users are prompted for a mandatory two-minute preparation step. The goal of this step is for users to familiarize with the test matching logic and rules before starting the calibration step. To this end, a confirmation step ensures that, during the preparation, users provided at least 70% correct digitsymbol matches out of a minimum of 20 answers. Contrary to the calibrated cFAST, there is no time limit to match individual symbols during preparation. Hence, symbols only change after the user presses a number from the selection panel. We refer to this method as manual. This functionality allows users to understand the test matching logic without time pressure.
During the preparation, users receive immediate feedback on whether their choice is correct or incorrect through a label located at the left side of the screen (cf. Figure 3). Failed preparation trials indicate that the user has not sufficiently trained in operating the test yet or did not perform it as fast as possible and thus must repeat it. The motivation for providing immediate feedback is to help the user understand the matching mechanics of the test. This functionality is particularly beneficial for unsupervised settings where no medical examiner is present to clarify doubts to the patients. Users can start the calibration step only after preparation is passed successfully. The calibration step lasts one minute and it uses the same logic of the preparation step, but without providing feedback. At this point, we assume users understand the test matching logic. Similar to preparation, calibration also employs a manual mechanism. However, its goal is to extract the users' reaction time, which we call calibrated rate. This rate is then used in cFAST. Thus, the manual function of the application has two goals: (1) during the preparation it allows sufficient time for users to understand the test matching logic, and (2) during calibration, it helps derive a personalized calibrated rate.
Deriving calibrated rate. The calibrated rate is a key feature of cFAST and it is derived from the 1-min calibration step of the personalization phase ( Figure 2). The calibration step has the same logic of the preparation but without user feedback. During calibration, symbols are only changed once the user taps a number from the selection panel (manual mechanism). We use 85% percentile of the response time exhibited during the calibration step to extract the calibrated rate, meaning each individual user may perform the task at different rates but always in relation to their top performance. Thus, the calibrated rate is tailored to each user, accounting for patients' different levels of disability. Once the calibrated rate is derived, cFAST is personalized and ready to use.
Eliciting cognitive fatigability. During a cFAST session, users are supposed to repeatedly match a symbol with their corresponding number. However, tasks of this nature are typical examples of speed-accuracy trade-off. 60 Participants tend to decide between performing the test with high accuracy but slow (i.e. low exertion) or fast but with low accuracy. Either of these scenarios would significantly limit the fatigue-inducing effect of the test. With cFAST, we seek to reduce this trade-off by adding a limited timeframe (calibrated rate) for each selection. This timeframe is indicated through a yellow progress bar ( Figure 4). With this approach, participants cannot spend unlimited time making a decision. Moreover, we hypothesize that the added pressure to make a fast selection contributes to the cognitive load required to induce cognitive fatigability.

Participants
We recruited 48 patients from the MS outpatient clinic of the Department of Neurology, University Hospital Zurich, between September 2020 and April 2021. Participants provided written consent following the Declaration of Helsinki. 61 The expanded disability status scale (EDSS) was obtained from the routine neurologic examinations performed at the hospital. This study was approved by the local ethics committee (Cantonal Ethics Committee Zurich, Switzerland). Inclusion criteria consisted of: (a) confirmed MS diagnosis and (b) age between 18 to 70. In addition, exclusion criteria included: diagnosis of depression, schizophrenia, bipolar disorders, attention deficit hyperactivity disorder, and regular intake of psychostimulants or anticonvulsant medications.

Procedure
Participants were briefly introduced to the study setup and completed a demographic questionnaire. Following, the study examiner showed them the application and the logic of the cFAST. Participants started with the 2-min preparation session. After successful completion, they performed the calibration step. Next, we asked participants to complete a first cFAST session of 5 min that is considered as a trial, to ensure they understand the test logic. Following, there was a short break in which participants filled out the FSMC questionnaire. Next, participants performed a second cFAST session. Previous cognitive fatigability studies including modified versions of the SDMT do full trials and discard this data before conducting the actual test to ensure participants understand the test logic. 40 Hence, all data analyses presented in this paper are based on the main cFAST and not on the trial data.

Data collection and processing pipeline
We collected touch data from the smartphone using a custom Android application that we developed. Each sample in our dataset contains the ID of the symbol to be matched, the user's selection if there was any, the current mapping rule, and the timestamp of the touch-down event. Our data processing pipeline includes three steps: (1) artifact detection, (2) cognitive adaptation removal, and (3) metrics extraction.

Artifact detection
We use response time as one of our primary performance metrics. Artifacts in response time typically appear when a user aims at tapping a digit to match the current symbol, but they run out of time. Hence, the newly displayed symbol is stored with a short response time, and the previous symbol is marked as a missed answer (Figure 5 left). These artifacts need to be identified and removed to avoid double-counting errors and compute a misleading response time. Therefore, in our preprocessing step, we remove any entry after a missed answer with a response time of less than the average minus two standard deviations of the entire cFAST session's response time. This results in subject-specific thresholds that account for the difference in average performance. With this method, we remove an average of 3.8 entries per session, with the average session containing 138 answers. Figure 5 right shows the same data after artifact removal.

Cognitive adaptation removal.
Previous cognitive fatigue studies describe the existence of an adaptation phase occurring at the beginning of a cognitive task due to some unspecific modulations of training and adaptation and highlight the need to account for these effects when studying fatigue. 62,63 A common strategy to deal with the adaptation in cognitive fatigue studies is to omit the start of the task. 62,63 An adaptation phase is not unique to cognitive tasks as it has also been detected in motor fatigability tasks. A similar strategy is applied in motor tasks by removing the start of the task to account for the adaptation period. 27,64,65 cFAST sessions exhibit an adaptation period in the initial part of the test, in particular for fatigued patients. Figure 6 depicts the average meannormalized response times for all fatigued pwMS for the whole 5 min cFAST in 30 s segments. During the first segments, we observe an increase in response time, followed by a decrease in response time in the third segment. We Figure 2. cFAST application logic. In the personalization phase, users complete the preparation and confirmation to ensure they understand the test's matching logic and the calibration to derive the calibrated rate used in cFAST. After this phase, cFAST is personalized and ready to be used. Note. cFAST, cognitive fatigability assessment test; CR, calibrated rate. attribute these changes in performance to an adaptation period before users are fully immersed in the test. 62,63 Hence, to make a fair comparison between the study participants we discard the first 60 s of all cFAST tests (42 sessions) before extracting the metrics and performing the data analysis.

Metrics extraction
We define two sets of metrics to quantify performance during a cFAST test session: (1) general metrics, which represent the average performance during an entire test session, and (2) fatigability metrics, which measure the change in performance occurring between the first third and last third of a test session. Table 1 displays an overview of the proposed metrics with their definition.

Statistical analyses
We use descriptive statistics to summarize and compare the study subpopulations. We evaluate the performance of our derived smartphone-based metrics to discriminate between cognitive-fatigued and non-cognitive fatigued subjects following the FSMC cognitive subscale (threshold = 22). 9 With t-tests, we explore group differences and consider P < .05 significant. Furthermore, through AUROC, we evaluate the performance of our derived smartphone-based metrics to classify cognitive fatigued versus non-cognitive fatigued subjects, independently of age and EDSS. We assess the robustness of our approach and compute confidence intervals for AUROC using stratified Monte-Carlo sampling 66 with 1000 iterations and randomly select (without replacement) in each iteration 1/2 of our participants' data (cFAST sessions) for evaluation. We partition the cFAST data into eight strata, following two partitioning criteria: (a) cognitive fatigued as a binary state according to FSMC cognitive subscale (threshold = 22) and (b) an EDSS group, which can be one of four: [0,1), [1,2), [2,3), and [3,∞). The idea of this partition is to find a metric that works best in the whole spectrum of disability. Each participant and their data are fully assigned to one of the resulting eight strata. Thus, when performing the stratified split, either a participant's data is fully contained in the split or not at all. Hence, with our approach, we split at the participant level, ensure class balance, and account for disability. Additionally, as age also influences cognitive performance, 33 we create eight additional strata following two partitioning criteria: (a) cognitive fatigued as a binary state and (b) age group, which can be one of four: (18,30), [30,40), [40,50) and [50,70]. This partition aims at reducing the influence of age in the metrics by assigning weights  according to the group sizes. Furthermore, we use one-way analysis of covariance (ANCOVA) with EDSS as a covariant to rule out the effect of disability when analyzing fatigue.
Finally, we further explore how cFAST and our proposed metrics relate to disability by measuring the performance of the metrics to rate disability according to EDSS. To this end, we split the study participants in two groups according to EDSS and analyzed the difference in performance between both groups. We classify patients with EDSS>1.5 as disabled and patients with EDSS< = 1.5 not disabled. For this evaluation, we partition our dataset into four strata, following two partitioning criteria: (a) disabled as a binary state according to the EDSS (0 for EDSS< = 1.5 and 1 for EDSS>1.5), and (b) cognitive fatigued as a binary state according to FSMC cognitive subscale (threshold = 22). Additionally, we use the same age groups as we did for the cognitive fatigue evaluation. We report the average AUROC with Figure 5. Artifacts in response time typically appear when a user provides an answer shortly after running out of time. Therefore, the pressed digit is associated with the newly displayed figure. As a result, the previous entry is classified as a missed answer, and the current figure has a very short response time (left side). We detect and remove these artifacts to avoid misleading errors and response time values (right side).

Δcorrect
Percent change in correct between the first and the last third of the task.

Δresponse time
Percent change in response time between the first and the last third of the task.

Δerrors
Percent change in errors between the first and the last third of the task.
95% confidence intervals. In addition, we include plots of the ROC curves for visual inspection.

Participant characteristics
We recruited 48 study participants and from those we excluded 6 due to comorbidities including iron deficiency, personality disorder, hypothyroidism, and narcolepsy type 1. Table 2 summarizes the study participants divided into the two subgroups of interest (i.e. no cognitive fatigue and cognitive fatigue according to the FSMC subscore).
Of the recruited pwMS, 21 did not have cognitive fatigue and 27 were cognitively fatigued. Of those we included in our analysis, 19 participants did not have fatigue and 23 were fatigued. Figure 7 shows the flow chart of the study and an overview of the excluded patients. The gender distribution of the participants in the two groups, the mean and standard deviation of their age, EDSS, and the FSMC subscales are listed in Table 2. As expected, we found a significant difference in all the FSMC scores. However, we found no statistically significant difference between the age and gender distributions of the two groups.

Correlation to clinical data
Our analysis indicates a significant Spearman rank correlation between several of the proposed general metrics and the clinical data. Table 3 shows an overview of all the computed correlations. The response time and correct metrics showed the highest correlation with EDSS (ρ = 0.6, P < .001 and ρ = -0.6, P < .001, respectively). Then, calibrated rate follows with ρ = 0.5, P = .001. On the other hand, errors did not significantly correlate to EDSS (ρ = -0.07, P = .67). We also found a significant correlation when analyzing the relationship between our metrics and the FSMC cognitive subscore. Again, response time and correct showed the highest correlation to the FSMC subscore (ρ = 0.39, P = .01 and ρ = -0.38, P = .01, respectively). Neither calibrated rate (ρ = 0.27, P = .09) nor errors (ρ = 0.1, P = .51) significantly correlated to the FSMC cognitive subscore. Age also correlates to the proposed general performance metrics. Among the correlating metrics, we found correct (ρ = -0.66, P < .001), response time (ρ = 0.61, P < .001), and calibrated rate (ρ = 0.51, P = .001). We found no significant correlation between the fatigability metrics and the clinical data.

cFAST relationship to perceived fatigue
We investigated the relationship between our metrics and perceived fatigue by determining statistically significant differences between the cognitive-fatigued and nonfatigued groups. Table 4 depicts a complete overview of the metrics' mean value and standard deviation for both groups, as well as the t-test results. Table S2 in the Supplements includes the non-parametric testing results using Mann Whitney U. We found a significant difference between both groups regarding response time (t = 2.16, P = .04, d = 0.669). The group with cognitive fatigue had an average response time of 2586.88 (SD = 961.28) ms, compared to the 2083.3 (SD = 358.31) ms of non-fatigued participants. We did not find a statistically significant difference in calibrated rate (t = 1.54, P = .13). Furthermore, we found that correct differed significantly between the groups (t = -2.8, P = .008, d = -0.868). The non-fatigued participants gave an average of 109.11 (SD = 15.97) correct answers, while the fatigued group had an average of 90.96 (SD = 24.21) correct answers. However, errors was not significantly different between the groups (t = 0.29, P = .77).
In terms of the fatigability metrics, we found that Δresponse time significantly differed between the groups (t = 2.27, P = .03, d = 0.703). On average, fatigued participants had a Δresponse time of 2.69 (SD = 4.94) ms, while non-fatigued participants had an average Δresponse time of −0.96 (SD = 5.5) ms. Δerrors and Δcorrect did not show a statistically significant difference between the groups (t = 0.81, P = .42 and t = -1.91, P = .06, respectively).
To analyze the temporal progression of participants' performance during a cFAST session, we performed a series of paired t-tests. Figure 8 on the left depicts the average normalized response time in the three thirds of the session for non-fatigued pwMS. While Figure 8, on the right, shows the results for pwMS with cognitive fatigue. For the group with no fatigue, the results are primarily flat and with a slight trend to improve over time, while for the fatigued group, we see a significant increase in response time (P = .02) between the first and last third of the session.

cFAST relationship to disability
The disabled group has a mean EDSS of 3.26 (SD = 1.54) and the non-disabled group has a mean EDSS of 0.54 (SD = 0.67). Detailed demographics of these groups is described in Table S1 in the supplements. Table 5 shows a complete overview of the metrics' average value and standard deviation for both groups, as well as the t-test results. Table S3 in the Supplements includes the non-parametric testing results using Mann Whitney U.
We found a significant difference in response time between the groups (t = 2.47, P = .02, d = 0.844). Participants without disability had an average response  We performed the same analysis with the fatigability metrics. Δcorrect, Δresponse time, and Δerrors showed no statistically significant difference between the not disabled and disabled groups (respectively t = -0.3, P = .77, t = 1.98, P = .33 and t = -0.6, P = .55).

Predictive power of the cFAST metrics to classify cognitive fatigue
To further explore the association between cognitive fatigability and perceived fatigue, we assessed the predictive power of our metrics to classify cognitive fatigue participants according to the FSMC cognitive subscale. Table 6 shows the results corresponding to the mean AUROC with its respective confidence intervals. The results indicate that the best features for fatigue independently of the EDSS are the fatigability metrics.

Predictive power of the cFAST metrics to classify disability
To evaluate the best cFAST metrics to classify disability independently of fatigue, we performed the same analysis as we did for cognitive fatigue. Results suggest that the general metrics are better than the fatigability metrics for disability in terms of AUROC. A complete overview of these results is shown in Table 7. Response time produced an average AUROC of 0.64 (95% CI 0.50-0.78), followed by age with an average AUROC of 0. 63  Differences in predictive power between the best fatigue and disability metrics Figure 9 on the left shows a visual representation of the ROC curves corresponding to the FSMC classification for  Δresponse time, best performing feature to classify cognitive fatigue and response time, best performing feature to classify disability. Δresponse time outperforms response time by 11 percentage points in classifying fatigue according to the FSMC. The center of the figure shows boxplots of Δresponse time for the groups fatigued and non-fatigued as well as the t-test results. The image displays the statistically significant difference between the fatigue and non-fatigued groups (t = 2.27, P = .03). Similarly, the right displays the boxplots corresponding to the response time. There is a statistically significant difference between the groups (t = 2.16, P = .04). The difference is significant also without the outlier in the fatigue group. We conducted a one-way analysis of covariance (ANCOVA) to examine whether response time differed between fatigue and non-fatigue groups when controlling for EDSS. For this analysis, we did remove the outlier in response time in the fatigue group as the outlier violated the normality assumptions of ANCOVA. We verified the test assumptions: Shapiro-Wilk test indicates the data is normally distributed for the group with no fatigue W(19) = .926 (P = .15) but not for the fatigued group W(22) = .899 (P = .03). However, as the distribution is close to normal and ANCOVAs are robust to this assumption violation, no steps were taken. Visual analysis with a scatter plot indicates similar regression slopes and an F test indicates no interaction between EDSS and fatigue group F = (1,37) = .24 (P = .64). Finally, Levene's Test confirms the homogeneity of variance with F(1,39) = 1.27 (P = .27). ANCOVA analysis reveals that after controlling for EDSS (disability), Figure 8. Average normalized response time during the three thirds of the cFAST session data after preprocessing for non-fatigued pwMS (left) and fatigued pwMS (right). A significant increase in the response time between the first and the last third of the task is present for fatigued patients only. The thirds were compared using a paired t-test. Note. cFAST, cognitive fatigability assessment test; pwMS, people with multiple sclerosis. Table 4. Metrics comparison between fatigued and non-fatigued patients with mean (SD), independent samples t-test (two-tailed) to assess whether there is a statistically significant difference between the groups, and Cohen's d effect size. there was no significant difference in response time between the fatigue groups F(1,38) = 1.42, P = .24. For a similar analysis on correct, refer to Supplements. Figure 10 shows data corresponding to disability classification according to the EDSS threshold. The left side of the Figure 10 shows a visual representation of the ROC curves corresponding to the disability classification for Δresponse time, the best performing feature to classify cognitive fatigue and response time, the best performing feature to classify disability. In this case, response time outperforms Δresponse time by 14 percentage points.

Discussion
We described the development process and pilot study of a new test (cFAST) for cognitive fatigability. Our result provides early evidence that the cFAST measurement could be useful to identify patients with cognitive fatigue, as assessed by the FSMC cognitive subscale. So far, only a few studies assess cognitive fatigability Table 5. Metrics comparison between disabled and not disabled patients with mean (SD), independent samples t-test (two-tailed) to assess whether there is a statistically significant difference between the groups, and Cohen's d effect size.  with specific tasks in pwMS. 39,40 Moreover, previous results are contradictory, with some showing fatigability while others not. 39,40 Cognitive fatigability studies are in their infancy, and research could benefit from new approaches and validation studies. Our approach differs from previous methods in that it is tailored to patients' disabilities with its calibration mechanism that also enforces rapid decision-making, which we believe contributes to eliciting cognitive fatigability within a single test session and in a short period. In addition, our smartphone-based test is easy to administer, portable, and designed to be applied outside clinical settings, potentially allowing for remote and frequent monitoring. Concerning cognitive testing, healthy controls and pwMS perceive the PASAT as unpleasant and less likable, while the SDMT is preferred and found appropriate for cognitive testing. 46 Thus, we believe cFAST will have good acceptance as it follows a similar logic to the SDMT and does not require patients to perform arithmetic operations under pressure like the PASAT. However, user acceptance of the cFAST needs to be assessed in future studies.
Fatigability metrics relate to fatigue, while general metrics relate to disability We derived two sets of metrics from cFAST: fatigability and general metrics. Our initial group-level analysis with a t-test revealed statistically significant differences between fatigued and non-fatigued patients with several general and fatigability metrics. Overall, we found more significant differences between the groups with the general metrics than fatigability metrics. However, results from the ANCOVA analysis revealed that EDSS is associated with the metrics response time and correct. Furthermore, the statistical difference in the fatigue groups in terms of these metrics is due to disability and not due to fatigue. Hence, after controlling for EDSS, the statistical difference between the groups disappears. We further analyzed how the groups' differences related to patients' disabilities. To this end, we divided our study population into two groups according to EDSS, disabled (EDSS>1.5) and non-disabled (EDSS< = 1.5). This grouping revealed statistically significant differences with the general metrics but not with the fatigability metrics. This result suggests that general metrics are related to and confounded by disability, while this is not true for the fatigability metrics. We conducted the AUROC analysis controlling for disability with Monte-Carlo simulations and stratified splits to further rule out the effect of disability from the fatigue analysis. These results confirmed our hypothesis that fatigability metrics are better predictors of fatigue than general metrics. Δresponse time, the best-performing metric to classify fatigue (with an average AUROC of 0.74), is 11 percentage points above response time, the best-performing general metric for fatigue. Conversely, general metrics dominate the disability classification, with response time being the best metric (average AUROC of 0.64), 9 percentage points above the best fatigability metric Δerrors. Analysis of the fatigability metrics revealed that, on average, performance during the tests tends to worsen for fatigued patients, while patients without fatigue tend to improve. Previous work on fatigability showed decline towards the end of sustained cognitive activity in pwMS while controls did not. 20,44 Our findings go in line with these results. However, our analysis focused only on pwMS to decrease disease-specific confoundings.

Consideration for remote and unsupervised monitoring
We designed and implemented cFAST to achieve remote monitoring. Hence, cFAST seeks to be self-explanatory. For instance, trials aim at familiarizing the users with the core test logic of matching numbers to symbols following the shown mapping rule. Thanks to the feedback displayed after every answer, users can quickly realize when they are making mistakes. The immediate feedback, together with the requirement of at least 70% correct answers out of a minimum of 20, helps us determine if the user has correctly understood the test logic and the requirement to perform it quickly. As described in the methods section, we derived the pace of the cFAST, calibrated rate, from the calibration phase. The speed requirement seeks to induce cognitive fatigability in a short period. Calibrated rate is derived for each patient, personalizing the test and adjusting for the different disability spectrum and baseline performance of the patients.

Implications of objective measurement of cognitive fatigue
A reliable and objective measurement of fatigability would help quantify the effectiveness of treatments, both in clinical trials and routine care, and it would also help clinicians distinguish between confounding comorbidities. Several randomized placebo-controlled clinical trials tested different compounds for treating fatigue. [11][12][13][14][15][16][17][18]67,68 However, results from these clinical trials are inconsistent. A common denominator in these trials is that they quantified the outcome measure using subjective questionnaires. It is known that the magnitude of the placebo effect is an important reason for the variability in the efficacy during trials. 16,69,70 Thus, an objective measurement would help clinicians overcome these limitations and complement questionnaires to evaluate treatments' efficacy. Our evaluation shows that cFAST is a promising tool for quantifying cognitive fatigue. Additionally, we believe the design of cFAST will allow remote and unsupervised monitoring, enabling more frequent assessments and detailed fatigue profiles of patients while reducing the cost associated with medical personnel and specialized equipment.

Limitations and future work
A limitation of our study is the lack of a gold standard cognitive fatigability assessment to validate our approach. Currently, there is no established method to quantify cognitive fatigability. Up until now, existing research has used cognitive tests protracted for extended periods as an attempt to induce and quantify fatigability. However, these approaches tend to be long, tedious, and costly. Moreover, results from these experiments are inconclusive. Hence, we directly compared our metrics to a widely accepted and validated fatigue questionnaire within MS research, the FSMC. The FSMC has the advantage of offering a subscale to evaluate cognitive fatigue independently of physical fatigue. Another limitation of our study is our sample size, limited to 42 study participants. We are aware that more extensive evaluations are needed to determine if the test can be established as a surrogate for perceived cognitive fatigue for clinical decision-making. In particular, our pilot study uses a cross-sectional design, thus, we are not able to define the clinical significance in the changes on the fatigability scores in individual patients. Future studies are needed to address this question. Finally, we designed cFAST to be suitable for remote and unsupervised monitoring. However, in this study, the evaluation was conducted within the hospital in a controlled environment. Further studies including longitudinal outside-the-hospital evaluations in larger MS cohorts and within-subjects comparison are needed to confirm the results. Nevertheless, we believe our study offers a detailed evaluation of our newly developed cognitive fatigability test.
As part of future work and prior to the clinical implementation more data has to be generated to further evaluate the generalization of the adaptation phase. Additionally, our study highlights the need for implementing changes to improve the data quality in an unsupervised setting. First, we recommend incorporating a statement in the cFAST instructions about the importance of conducting the test in a distraction-free environment (i.e. activate "do not disturb" modality, use quiet room). Second, we recommend automatically dismissing test sessions if no input is recorded within a certain period after the start. Distractions in uncontrolled environments (e.g. incoming phone calls or messages) can result in empty test sessions or significant periods without data, thus producing erroneous values for the proposed set of metrics. Moreover, future studies should examine whether cFAST could aid clinicians distinguishing between confounding such as depression, sleepiness, or others. Finally, we need to investigate further the frequency that patients need to conduct the calibration phase in unsupervised settings. However, we believe that calibration has to be performed only once and that the calibrated rate can be recomputed, if necessary, directly from the existing patients' cFAST sessions. Nonetheless, this requires further studies, including longitudinal data.

Conclusions
We introduced cFAST, a novel smartphone-based test to quantify cognitive fatigability tailored to the user's disability by its calibration mechanism. With cFAST, we aim at having an objective surrogate of fatigue that allows monitoring of individual patients over time in uncontrolled environments (e.g. at home). We do not aim to have a diagnostic tool, but rather a solution for clinicians to make informed and timely decisions as to whether a patient's condition is improving or deteriorating and act accordingly. Results from our pilot study provide evidence supporting the validity of our approach and show that the fatigability metrics could potentially be used as a surrogate for perceived cognitive fatigue and motivate further research in this area.